Here is my attempt to replicate the PCA analysis from Kosack et al. and then overlay Annalisa/Rachel’s data. I’ll refer to this as our data from here on.
If you can’t be bothered to read beyond this bit, the messages are:
Therefore my conclusions are that our data sits within the Kosack Skin cluster along with the Nerve sample, but are all considered the more or less identical in that projection. When just using our data each cell line is different from one another.
I’m not sure how justifiable it is to subset the data further, especially when considering that the Kosack DFT1 cell line is an outlier.
From the Kosack spreadsheet:
“Differential analysis of 4981 proteins detected across more than 80% of the samples. The protein abundance data was normalized by variance stabilizing transformation. Missing values are imputed on normalized abundance values with the k Nearest Neighbors (kNN) algorithm implemented in the R Bioconductor package impute. Differential analysis, tumor versus the healthy, was performed between biopsies (excluding the tumor cell line) using the limma Bioconductor package.”
In the methods:
“Principal component analysis was performed on 3894 out of 6672 proteins quantified in all 19 samples.”
As usual when I try to replicate something, I get different numbers and I don’t know if it’s me or them! I found 3514 proteins in their spreadsheet that were quantified in all 19 samples.
Our data contains 2772 proteins quantified in three samples as an average of three replicates for SALEM, DFT1 and DFT2. The replicates have 3628 proteins quantified, but I can’t tell how this has been reduced to 2772, so I can’t just use the replicates here.
Here I present:
The approach here is that each protein represents a variable and we are representing the samples (each cell type, devil tumour etc.) in terms of the relative abundance of their proteins. The variation (and similarity) between samples is represented by their distance and direction on the plot. The data itself is not changed, just the representation, but this is how PCA can act as a form of clustering.
We can combine datasets in two different ways:
dat <- readxl::read_excel("dft2_proteins_300718.xlsx",sheet = 1) %>%
clean_names()
cemm <- readxl::read_excel("SuppTableS4_Proteomics.xlsx",
sheet = 2,
skip = 0,
na = "NA") %>%
clean_names()
In our data:
# Count prots
dat %>% summarise(Number_of_proteins = n())
## # A tibble: 1 x 1
## Number_of_proteins
## <int>
## 1 2772
In Koscak data:
Keep only the 19 samples of interest and proteins where quantification has been made for all samples. Drop duplicate protein names. Count proteins.
cemm_tidy <- cemm %>%
select(gene_name = gene, 9:27) %>%
distinct(gene_name,.keep_all = T) %>%
drop_na()
# Count prots
cemm_tidy %>% summarise(Number_of_proteins = n())
## # A tibble: 1 x 1
## Number_of_proteins
## <int>
## 1 3514
Make a joint dataset of shared proteins:
dat_join <- dat %>% select(-ensembl_gene_id,-refseq) %>%
inner_join(cemm_tidy)
# Count shared prots
dat_join %>% summarise(Number_of_proteins = n())
## # A tibble: 1 x 1
## Number_of_proteins
## <int>
## 1 1400
PCA with the Kosack data of 3514 proteins and 19 samples.
I can kind of replicate the Kosack figure, but not quite!
Figure 2 from Kosack et al.
PCA of 22 samples and 1400 proteins from a combined dataset
To project our data onto the Kosack data, I first need to re-do the PCA for Kosack data only using the 1400 shared proteins.
I’ve done this two ways: